Analysing Gene Expression of Breast Cancer patients

Iben Sommerand s203522
Jonas Sennek s203516
Emilie Wenner s193602
Torbjørn Bak Regueira s203555
Vedis Arntzen s203546

Introduction

  • 2 296 840 new breast cancer patients in 20221.

  • Aim of project: Exploring and analyzing patterns in breast cancer gene expression data.

Materials and method

  • The analysis was performed on the dataset “GDC TCGA Breast Cancer (BRCA)” from xenabrowser.net

  • Our data:

    • Gene expression (RNAseq) and phenotype metadata

Notes: Materials: What data did you use and where did you get it from?

Methods: Data preparation

  • Data obtained programatically

  • Pivoted the gene expression dataset longer to be more tidy

  • The two datasets were joined on the patient IDs

  • Mutated the dataset to add new columns:

    • Age groups
    • Converted days to years for several relevant columns
  • Analytical methods:

    • Descriptive data analysis, PCA and Linear Modelling

Notes: Methods: Which modelling did you use? Think of the methods section as a recipe for how to go from raw to results => Flow chart?

Methods:

Show flowchart here!!! (figure 1)

Descriptive analysis: Overview of the data

Figure 2: Gender and ethnicity distribution within the data

Figure 3: Cancer stage distribution within the data

Descriptive analysis: Vitality

Figure 4: Vitality based on cancer type

Figure 5: Vitality by age

Descriptive analysis: Survival time and prior malignancy

Figure 6: Survival time by prior malignancy

Analysis: Linear model

Show Jonas plot here

Analysis: Investigating cancer stages

Figure 8: Survival time by Cancer Stage

Figure 9: Vital Status by Cancer Stage

Analysis: PCA

Figure 9: Principal Component Analysis

Discussion:

  • Catching the cancer in an early stage seems to increase chance of survival

  • Limitations and future work

    • Compare against healthy tissue samples (eg. GTEX)